Update PyTorch pin to the version which support `elapsed_time`; remove our patch for `elapsed_time` #2952

anmyachev · 2024-12-06T09:40:31Z

PyTorch commit:

pytorch/pytorch@77748ed
pytorch/pytorch@8dd4673
pytorch/pytorch@61dc5e9 - https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/12242668384 (passed)
pytorch/pytorch@2d70875 - https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/12234429829 (RuntimeError: Overflow when unpacking long)

Inductor CI:

https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/12196498158
https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/12200420024 (another commit, failed because of [Break XPU] aten::rrelu_with_noise C++ signatures changed by #141867 which is incompatible with the define in torch-xpu-ops. pytorch/pytorch#142102)
https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/12202116045 (issue with icpx)
https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/12203639957 (passed with patch)

The impact on the Triton tutorials performance measurements should be assessed prior to the patch deprecation. In case if any regressions and/or new issues the mitigations should be implemented prior to the patch deprecation.

According to https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/12242518858/job/34150030016 it's fine to go.

…our patch for elapsed_time Signed-off-by: Anatoly Myachev <[email protected]>

anmyachev · 2024-12-06T11:56:05Z

@chengjunlu there are many messages like: warning: Double arithmetic operation is not supported on this platform with FP64 conversion emulation mode (poison FP64 kernels is enabled). in https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/12196498158/job/34024201713. They really clutter up the logs and don't provide much information. It would be great to remove them somehow. Is this possible?

anmyachev · 2024-12-06T14:07:15Z

@pbchekin do you know why? (from https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/12196498158)

FAILED [4.7235s] inductor/test_triton_kernels.py::CustomOpTests::test_autotune_unbacked - torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
FileNotFoundError: [Errno 2] No such file or directory: '/opt/intel/oneapi/bin/icpx'

…-triton into amyachev/issue2945

Signed-off-by: Anatoly Myachev <[email protected]>

pbchekin · 2024-12-06T15:20:10Z

@pbchekin do you know why? (from https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/12196498158)
FAILED [4.7235s] inductor/test_triton_kernels.py::CustomOpTests::test_autotune_unbacked - torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
FileNotFoundError: [Errno 2] No such file or directory: '/opt/intel/oneapi/bin/icpx'

There is no such file, indeed. This file is located at:

/opt/intel/oneapi/compiler/2024.1/bin/icpx, /opt/intel/oneapi/pytorch-gpu-dev-0.5/bin/icpx in PTDB
/opt/intel/oneapi/2025.0/bin/icpx, /opt/intel/oneapi/compiler/2025.0/bin/icpx in DLE

Signed-off-by: Anatoly Myachev <[email protected]>

anmyachev · 2024-12-06T17:25:46Z

@pbchekin do you know why? (from https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/12196498158)
FAILED [4.7235s] inductor/test_triton_kernels.py::CustomOpTests::test_autotune_unbacked - torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
FileNotFoundError: [Errno 2] No such file or directory: '/opt/intel/oneapi/bin/icpx'
There is no such file, indeed. This file is located at:

/opt/intel/oneapi/compiler/2024.1/bin/icpx, /opt/intel/oneapi/pytorch-gpu-dev-0.5/bin/icpx in PTDB

/opt/intel/oneapi/2025.0/bin/icpx, /opt/intel/oneapi/compiler/2025.0/bin/icpx in DLE

I found the reason - it's because of how pytorch searches for sycl home and it probably relates to pytorch/pytorch@4742080.

Ref to PyTorch: https://github.com/pytorch/pytorch/blame/5872a8c6b00a5c9e45ac4bc99a5c87b93a93aa94/torch/utils/cpp_extension.py#L147

def _find_sycl_home() -> Optional[str]:
    """Find the OneAPI install path."""
    # Guess #1
    sycl_home = os.environ.get('ONEAPI_ROOT')
    if sycl_home is None:
        # Guess #2
        icpx_path = shutil.which('icpx')
        if icpx_path is not None:
            sycl_home = os.path.dirname(os.path.dirname(
                os.path.realpath(icpx_path)))

    if sycl_home and not torch.xpu.is_available():
        print(f"No XPU runtime is found, using ONEAPI_ROOT='{sycl_home}'",
              file=sys.stderr)
    return sycl_home

scripts/patch-pytorch.sh

anmyachev · 2024-12-07T19:01:53Z

@pbchekin any chance that ONEAPI_ROOT for PTDB is /opt/intel/oneapi/2025.0 instead of /opt/intel/oneapi (for DLE)? That would explain the error.

My guess comes from pytorch/pytorch#142242 (comment).

pbchekin · 2024-12-07T20:37:33Z

@pbchekin any chance that ONEAPI_ROOT for PTDB is /opt/intel/oneapi/2025.0 instead of /opt/intel/oneapi (for DLE)? That would explain the error.

My guess comes from pytorch/pytorch#142242 (comment).

Nope:

$ source /opt/intel/oneapi/setvars.sh
...
$ echo $ONEAPI_ROOT
/opt/intel/oneapi

Signed-off-by: Anatoly Myachev <[email protected]>

anmyachev · 2024-12-08T12:32:01Z

Inductor CI with changes from PR 2962: https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/12221768210

chengjunlu · 2024-12-09T01:27:24Z

@chengjunlu there are many messages like: warning: Double arithmetic operation is not supported on this platform with FP64 conversion emulation mode (poison FP64 kernels is enabled). in https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/12196498158/job/34024201713. They really clutter up the logs and don't provide much information. It would be great to remove them somehow. Is this possible?

This warning information is a DPC++ feature. Let me check with torch team how to disable it.

.github/pins/pytorch-upstream.txt

Signed-off-by: Anatoly Myachev <[email protected]>

This reverts commit 6e3c80b.

Signed-off-by: Anatoly Myachev <[email protected]>

This reverts commit c13270d.

anmyachev · 2024-12-09T17:33:35Z

Hi @guangyey!

After this change pytorch/pytorch#135567 our tutorials started to fail with RuntimeError: Overflow when unpacking long exception (ref: https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/12239696322/job/34140774158?pr=2952).

A quick search through the PyTorch codebase gave me the idea that the problem in infer_scalar_type function https://github.com/pytorch/pytorch/blob/90fc2b42e3e2d51b26a96df0dff4a644e218f8ab/torch/csrc/utils/tensor_new.cpp#L148. Returning type ScalarType:Long instead of type ScalarType::Uint64. Could you take a look?

Example:

# Each pointer is obtained through `tensor.data_ptr()` function.
d_a_ptrs = torch.tensor([18374686479673720832, 18374967954644140032, 18374967954645188608, 18374967954645450752], device=device) # <- failed

Stack trace:

#0  0x00007fffed0824a1 in __cxa_throw () from /lib/x86_64-linux-gnu/libstdc++.so.6
#1  0x00007fffec20b246 in THPUtils_unpackLong(_object*) () from .../intel-xpu-backend-for-triton/.scripts_cache/pytorch/torch/lib/libtorch_python.so
#2  0x00007fffec9b7911 in torch::utils::store_scalar(void*, c10::ScalarType, _object*) ()
   from .../intel-xpu-backend-for-triton/.scripts_cache/pytorch/torch/lib/libtorch_python.so
#3  0x00007fffec9c1af8 in torch::utils::(anonymous namespace)::recursive_store(char*, c10::ArrayRef<long>, c10::ArrayRef<long>, long, c10::ScalarType, unsigned long, _object*) [clone .isra.0] () from .../intel-xpu-backend-for-triton/.scripts_cache/pytorch/torch/lib/libtorch_python.so
#4  0x00007fffec9c3460 in torch::utils::(anonymous namespace)::internal_new_from_data(c10::TensorOptions, c10::ScalarType, std::optional<c10::Device>, _object*, bool, bool, bool, bool) () from .../intel-xpu-backend-for-triton/.scripts_cache/pytorch/torch/lib/libtorch_python.so
#5  0x00007fffec9c8d77 in torch::utils::tensor_ctor(c10::DispatchKey, c10::ScalarType, torch::PythonArgs&) ()
   from .../intel-xpu-backend-for-triton/.scripts_cache/pytorch/torch/lib/libtorch_python.so
#6  0x00007fffec4ed8f2 in torch::autograd::THPVariable_tensor(_object*, _object*, _object*) ()
   from .../intel-xpu-backend-for-triton/.scripts_cache/pytorch/torch/lib/libtorch_python.so
#7  0x00005555556985a6 in cfunction_call (func=0x7ffff7631800, args=<optimized out>, kwargs=<optimized out>)

Simplified example:

import torch

test = torch.rand((10, 10), device="xpu", dtype=torch.float16)
test_ptr = test.data_ptr()
torch.tensor(test_ptr, device="xpu")  # <- RuntimeError: Overflow when unpacking long

Signed-off-by: Anatoly Myachev <[email protected]>

anmyachev · 2024-12-09T19:09:23Z

I decided to try specifying the dtype directly, as suggested in pytorch/pytorch#135628. @guangyey do I understand correctly that it is now recommended to use this approach in the code?

guangyey · 2024-12-10T02:24:04Z

I decided to try specifying the dtype directly, as suggested in pytorch/pytorch#135628. @guangyey do I understand correctly that it is now recommended to use this approach in the code?

In my understanding, specifying the dtype is the correct approach. This is because we change data_ptr from int64 to uint64 in pytorch/pytorch#135567 and the default int tensor data type is int64. So, users need to be aware to specify dtype to uint64 if they give an overflow value.

>>> b = torch.tensor([1,2])
>>> b.dtype
torch.int64

guangyey

LGTM.

Update PyTorch pin to the version which support elapsed_time; remove …

88f56c5

…our patch for elapsed_time Signed-off-by: Anatoly Myachev <[email protected]>

anmyachev linked an issue Dec 6, 2024 that may be closed by this pull request

[Pytorch pin update] Update PyTorch pin and deprecate elapsed_time patch #2945

Closed

anmyachev added 2 commits December 6, 2024 15:19

Merge branch 'main' of https://github.com/intel/intel-xpu-backend-for…

985a625

…-triton into amyachev/issue2945

another commit

b3dc591

Signed-off-by: Anatoly Myachev <[email protected]>

another commit

797b46a

Signed-off-by: Anatoly Myachev <[email protected]>

anmyachev commented Dec 6, 2024

View reviewed changes

scripts/patch-pytorch.sh Show resolved Hide resolved

Update scripts/patch-pytorch.sh

bd5b0b9

anmyachev mentioned this pull request Dec 6, 2024

[WIP][XPU] Fix path to sycl home if ONEAPI_ROOT is defined pytorch/pytorch#142242

Closed

try changes from PR 2962

e9c620d

Signed-off-by: Anatoly Myachev <[email protected]>

anmyachev mentioned this pull request Dec 8, 2024

Use DLE specific vars file #2962

Merged

Merge branch 'main' into amyachev/issue2945

f446dfd

anmyachev commented Dec 9, 2024

View reviewed changes

.github/pins/pytorch-upstream.txt Show resolved Hide resolved

anmyachev and others added 6 commits December 9, 2024 12:11

Update .github/pins/pytorch-upstream.txt

35bf37a

use PyLong_AsUnsignedLongLong

6e3c80b

Signed-off-by: Anatoly Myachev <[email protected]>

Revert "use PyLong_AsUnsignedLongLong"

fc8cd2c

This reverts commit 6e3c80b.

REVERTME just test

c13270d

Signed-off-by: Anatoly Myachev <[email protected]>

test commit which change tensor.data_ptr() behavior

0d22d82

Signed-off-by: Anatoly Myachev <[email protected]>

Revert "REVERTME just test"

4d27e94

This reverts commit c13270d.

specify 'dtype=torch.uint64' as a workaround

8472895

Signed-off-by: Anatoly Myachev <[email protected]>

anmyachev marked this pull request as ready for review December 9, 2024 19:47

anmyachev requested review from alexbaden and guangyey December 9, 2024 19:47

guangyey approved these changes Dec 10, 2024

View reviewed changes

Merge branch 'main' into amyachev/issue2945

49b5a2e

anmyachev merged commit 3ccab57 into main Dec 10, 2024
5 checks passed

anmyachev deleted the amyachev/issue2945 branch December 10, 2024 11:29

Update PyTorch pin to the version which support elapsed_time; remove our patch for elapsed_time #2952

Update PyTorch pin to the version which support elapsed_time; remove our patch for elapsed_time #2952

Uh oh!

Conversation

anmyachev commented Dec 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

anmyachev commented Dec 6, 2024

Uh oh!

anmyachev commented Dec 6, 2024

Uh oh!

pbchekin commented Dec 6, 2024

Uh oh!

anmyachev commented Dec 6, 2024

Uh oh!

Uh oh!

anmyachev commented Dec 7, 2024

Uh oh!

pbchekin commented Dec 7, 2024

Uh oh!

anmyachev commented Dec 8, 2024

Uh oh!

chengjunlu commented Dec 9, 2024

Uh oh!

Uh oh!

anmyachev commented Dec 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

anmyachev commented Dec 9, 2024

Uh oh!

guangyey commented Dec 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

guangyey left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Update PyTorch pin to the version which support `elapsed_time`; remove our patch for `elapsed_time` #2952

Update PyTorch pin to the version which support `elapsed_time`; remove our patch for `elapsed_time` #2952

anmyachev commented Dec 6, 2024 •

edited

Loading

anmyachev commented Dec 9, 2024 •

edited

Loading

guangyey commented Dec 10, 2024 •

edited

Loading